1) Motivating local fits

1-1) Limitations of parametric regression

We've talked at great length about models where you specify some set of features. And what that results in is a model with fixed flexibility. There's a finite capacity to how flexible the model can be based on the specified features. But now we're gonna take a totally different approach. There are approaches that are called nonparametric approaches. And the one that we're gonna look at in this module is called k-nearest neighbors and kernel regression. And what these methods allow you to do is be extremely flexible. The implementations are very, very simple and they allow the complexity of the models you infer to increase, as you get more data. This really simple approach is surprisingly hard to beat. But as we're gonna see, all of this relies on us having enough data to use these types of approaches.

To start with, let's talk about this idea of fitting afunction fit globally versus fit locally.

Screenshot taken from Coursera 0:30

So let's imagine that we have the following data that represent observations of houses with a given square feet and their associated sales price. And maybe just for the sake of argument in this module, let's imagine that for small houses there tends to be a linear relationship between square feet and house value. But then that relationship tends to taper off and there's not much change in house price for square feet. But then you get into this regime of these really large houses where as you increase the square feet, the prices can just go way up because maybe they become very luxurious houses.

In the types of models that we've described so far, we've talked about fitting some function across the whole input space of square feet. So for example, if we assume a really simple model, like just a constant fit, we might get the following fit to our data.

Screenshot taken from Coursera 1:00

Or if we assume that we're gonna fit some line to the data, we might get the following or a quadratic function.

Screenshot taken from Coursera 1:00

And we might say, well, this data kinda looks like there's some cubic fit to it. And so the cubic fit looks pretty reasonable, but the truth is that this cubic fit is kind of a bit too flexible, a bit too complicated for the regions where we have smaller houses. It looks like it fits very well for our large houses but it's a little bit too complex for lower values of square feet.

Screenshot taken from Coursera 2:00

Because, really for this data you could describe it as having a linear relationship like, I talked about for low square feet value. And then, just a constant relationship between square feet and value for some other region. And then, having this maybe quadratic relationship between square feet and house value when you get to these really, really large houses.

So this motivates this idea of wanting to fit our function locally to different regions of the input space. Or have the flexibility to have a more local description of what's going on then our models which did these global fits allowed.

Screenshot taken from Coursera 3:00

So what are we gonna do? So we want to flexibly define our f(x) the relationship in this case between square feet and house value to have this type of local structure. But let's say we don't want to assume that there is what are called structural breaks that there are certain change points where the structure of our regression is gonna change. In that case, you'd have to infer where those break points are. Instead let's consider a really simple approach that works well when you have lots of data.

Screenshot taken from Coursera 3:20

2) Nearest neighbor regression

2-1) 1-Nearest neighbor regression approach

And this simple approach is called nearest neighbor regression.

So the idea of nearest neighbor regression is you have some set of data points, which are shown as these blue circles. And then, when you go to predict the value at any point in your input space. All you do is you look for the closest observation that you have and what its outputted value is, and you just predict that your value is exactly equivalent to that value.

Okay, so this is really the simplest model you can think of. So here we're showing what the resulting fit would look like. So the green line is the fit and we have these little arrows showing which are the closest observations. So just to be clear, if I'm somewhere in this input space, I have some number of square feet for my house. I'm gonna look for the closest observation and I'm gonna choose my predicted value for my house to be exactly the same as this other observation. So we can see that this leads to this idea of having local fits where these local fits are defined around each one of our observations. And how local they are, how far they stretch is based on where the placement of other observations are. And this is called one nearest neighbor regression.

Screenshot taken from Coursera 1:00

And this is really what people do naturally. So if you think of some real estate agent out there and you're selling your house, and the real estate agent wants to assess the value of your house. What are they gonna do? Well that agent is gonna look at all the other houses that were sold and is gonna say, well what's the most similar house to your house and how much did it sell for? So implicitly what this real estate agent is doing is looking for your nearest neighbor, which house is most similar to yours, and then is gonna assess the value of your house as being exactly equal to the sales price of this other house. That's the best guess that your real estate agent has for the value of your house and how much it's likely to sell for on the market.

Screenshot taken from Coursera 2:00

So let's formalize this nearest neighbor method a bit.

  • We're gonna have some data set in our housing obligation. It's gonna be pairs of house attributes and values associated with each house, so the sales price of the house. And so we're gonna denote this as (x,y) for some set of observations, 1 to capital N. So this is what we assume our data set is.
  • And then we assume that we have some query house which is not in our training data set. So this is some point, $x_q$, this is our some house that we're interested in the value. And the first step of the nearest neighbor method is to find the closest other house in our dataset.
  • So specifically let's call our x nearest neighbor to be the house that minimizes over all of our observations, i, the distance between $x_i$ and our query house, $x_q$. So to be clear, in the picture we had on the previous slide, this x nearest neighbors would just be that big pink house.
  • Then what we're gonna do is we're simply gonna predict the value of our query house which is our big lime green house. To be exactly the value or the sales price of this big pink house.

And so just to be clear, let's say this was the square feet of our green house. We go, and what we're gonna do is we're gonna look for our nearest neighbor. We search along square feet and say, which house has the most similar square feet to mine? Happens to be this house, this is our pink house. Which will be, if there's space I'll just write it here, x nearest neighbor. And then we say, what is the value of that house, which is how much it sold for y, nearest neighbor. And that's exactly what we're gonna predict for our query house.

Screenshot taken from Coursera 4:00

But first, let's talk about what one nearest neighbor looks like in higher dimensions because so far, we've assumed that we have just one input like square feet and in that case, what we had to do was define these transition points. Where we go from one nearest neighbor to the next nearest neighbor. And thus changing our predicted values across our input space square feet. But how do we think about doing this in higher dimensions?

Well, what we can do is look at something that's called a Voronoi diagram or a Voronoi tessellation. And here, we're showing a picture of such a Voronoi tessellation, but just in two dimensions, though this idea generalizes to higher dimensions as well. And what we do is we're just gonna divide our input space into regions. So in the case of two dimensions, they're just regions in this two dimensional space. So I'm just gonna highlight one of these regions. And each one of these regions is defined by one observation.

And what defines the region is the fact that any other point in this region, let's call it x. Let's call this xi, and this other point some other x. Well, this point x is closer to xi, let me write this. So x closer to xi than any other xj, for j not equal to i. Meaning any other observation.

So in pictures, what we're saying is that the distance from x to xi is less than the distance to any of these other observations. In our dataset. So what that means is, let's say that x is our query point. So let's now put a sub q. If x is our query point, then when we go to predict the value associated with xq. We're just gonna look at what the value was associated with xi, the observation that's contained within this region. So this Voronoi diagram might look really complicated, but we're not actually going to explicitly form all these regions. All we have to do is be able to compute the distance and define between any two points in our input space and define which is the closest.

Screenshot taken from Coursera 6:00

2-2) Distance metrics

How are defining distance? Well, in 1-d it's really straightforward because our distance on continuous space is just gonna be Euclidean distance. Where we take our input-$x_i$ and our query $x_q$ and look at the absolute value between these numbers. So, these might represent square feet for two houses and we just look at the absolute value of their difference.

But when we get to higher dimensions, there's lots of interesting distance metrics that we can think about. And let's just go through one that tends to be pretty useful in practice, where we're going to simply Weight the different dimensions differently but use standard Euclidian distance otherwise. So, it looks just like Euclidian distance, but we're going to have different weightings on our different dimensions.

Screenshot taken from Coursera 0:30

So, just to motivate this, going back to our housing application, you could imagine that you have some set of different inputs, which are Attributes of the house, like how many bedrooms it has. How many bathrooms, square feet. All our standard inputs that we've talked about before. But when we think about saying which house is most similar to my house. Well, some of these inputs might matter more than others when I think about this notion of similarity. So, for example number of bedrooms, number of bathrooms, square feet of the house. Might be very relevant, much more so than what year the house was renovated when I'm going to assess the similarity.

Screenshot taken from Coursera 1:00

So, to account for this, what we can do is we can define what's called a scaled Euclidean distance, where we take the distance between now this vector of inputs, let's call it $x_j$. And this vector of inputs associated with our query house $x_q$ and we're gonna component wise look at their difference squared. But then we're gonna scale it by some number. And then we're gonna sum this over all our different dimensions.

So, in particular I'm using this letter a to denote the scaling. So, $a_d$ is the scaling on our dth input, and what this is capturing is the relative importance of these different inputs in computing this similarity.

And after we take the sum of all these squares we're gonna take the square root and if all these a values were exactly equal to 1, meaning that all our inputs had the same importance then this just reduces to standard Euclidean distance.

So, this is just one example of a distance metric we can define at multiple dimensions, there's lots and lots of other interesting choices we might look at as well.

Screenshot taken from Coursera 2:00

Let's visualize what impact different distance metrics have on our resulting nearest neighbor fit. So, if we just use standard Euclidean distance on the data shown here. We might get this image, which is shown on the right where the different colors indicate what the predicted value is in each one of these regions. Remember each region you're gonna assume any point in that region, the predicted value is exactly the same because it has the same nearest neighbor. So, that's why we get these different regions of constant color.

But if we look at the plot on the left hand side, where we're using a different distance metric, what we see is we're defining different regions where again those regions mean that any point within that region is closer to the one data point lying in that region, than any of the other data points in our training data set, but the way this distance is defined is different so thus the region looks different, so for example, with this Manhattan distance what this is saying just think of New York and driving along the streets of New York. It's measuring distance along this axis-aligned directions, so it's distance along the x direction plus distance along the y direction which is a different difference than our standard Euclidean distance.

Screenshot taken from Coursera 4:00

2-3) 1-Nearest neighbor algorithm

Well we've talked about what 1-Nearest Neighbors is doing intuitively, now let's talk about how to actually implement 1-Nearest Neighbors in practice.

So to perform our Nearest Neighbor search, what we have to do is we have to specify a query input, which will be for example, some house that we're interested in assessing it's value. And then we also need to provide our data set, our set of training houses. And we also need to specify a distance metric, which is our way of measuring the similarity between houses. And then the output of our one Nearest Neighbor search is gonna be the most similar house. And we can use the value associated with that house as our prediction for our query house.

Screenshot taken from Coursera 0:30

So more explicitly our one nearest neighbor algorithm we can initialize what I'm calling distance to Nearest Neighbor to be infinity and initialize our closest house to be the empty set. Then what we do is we're going to step through every house in our dataset. And we're going to compute the distance from our query house to the house that we're at at our current iteration, and if that distance is less than the current distance to our nearest neighbor. Which at first is infinity, so at the first iteration. You're definitely gonna have a closer house.

So the first house that you search over you're gonna choose as your nearest neighbor for that iteration. But remember, you're gonna continue on. And so what you're gonna do is, if the distance is less than the distance to your nearest neighbor, you're gonna set your current nearest neighbor equal to that house, or that x. And then you're gonna set your distance to nearest neighbor equal to the distance that you had to that house. And then you're just gonna iterate through all your houses. And in the end what you're gonna return is the house that was most similar. And then we can use that for prediction to say that the value associated with that house is the value we're predicting for our query house.

Screenshot taken from Coursera 2:00

So let's look at what this gives us on some actual data. So, here, we drew some set of observations from a true function that had this kind of curved shape to it. And, the blue line indicates the true function that was used to generate this data. And what the green represents is our one nearest neighbor fit. And what you can see is that the fit looks pretty good for data that's very dense in our input space. So, dense and x. We get lots of observation across our whole input space.

Screenshot taken from Coursera 2:30

But if we just removed some observations in a region of our input space, and things start to look not as great because Nearest Neighbor really struggles to interpolate across regions of the input space where you don't have any observations or you have very few observations.

Screenshot taken from Coursera 2:45

And likewise, if we look at a data set that's much noisier, we see that our Nearest Neighbor fit is also quite wild. So this looks exactly like the types of plots we showed when we talked about models that were overfitting our data. So what we see is that one, Nearest Neighbors is also really sensitive to noise in the data.

Screenshot taken from Coursera 3:10

3) k-Nearest neighbors and weighted k-nearest neighbors

3-1) k-Nearest neighbors regression

What this motivates us to look at is something called k-Nearest neighbors, which is a very simple extension off of one nearest neighbors. And, again going back to our housing application, it's what real estate agents do. I said they just look at the nearest house, but that's not quite true, they tend to go out and get a set of what are called comps.

So, instead of just looking for the closest, most similar house to yours, they're going to collect a set of houses that are similar. And look at the values for that set of houses, and say, well, your house value should be somehow related to these set of houses. Because the agent is understanding that the one most similar house, there might have been something strange about the value it got. Maybe a buyer was really drawn to some specific feature of that house, and that drove up the cost. Or maybe it sold at a strange time in the year, and maybe that gave it an artificially low value. So looking at a larger set of houses to assess the value of yours can give you a much more reliable estimate.

Screenshot taken from Coursera 1:00

So, formally k nearest neighbors is gonna take the same data set that we had before and a query point. And what it's gonna do, is it's gonna find the k closest observations in our data set. So specifically let's call it x nearest neighbor 1, x nearest neighbor 2, all the way to x nearest neighbor k. I'm gonna assume that I've ordered these nearest neighbors so that x nearest neighbor k, is the furthest, the least similar out of this set to the query house.

Well this set is defined such that for any $x_i$ not in my nearest neighbor set. Then the distance, between $x_i$ and my query point $x_q$ is greater than or equal to the distance between my kth nearest neighbor and xq. $$distance(x_i, x_q) \geq distance(x_{NNk}, x_q)$$

  • So what this means is that, every other observation in the data set, is further away than my kth nearest neighbor. That's what defines it as my kth nearest neighbor.

So then what do I do for my prediction? Because when we were doing one nearest neighbor regression, we were just taking our nearest neighbor looking at the output associated with that nearest neighbor and saying that was my predicted value. But now I have a set of nearest neighbors. And so I need to decide what I'm gonna do with these nearest neighbors and their associated outputs to form my prediction.

And what k nearest neighbors does, is it says that predicted value associated with my query point is simply gonna be the average value of all the nearest neighbor outputs. $$\hat y_q = \frac{1}{k}(y_{NN_1} + y_{NN_2} + ... + y_{NN_k}) = \frac{1}{k} \sum_{i=1}^k y_{NN_j}$$

So k nearest neighbors, like I said, is a very simple variant or generalization of one nearest neighbors.

Screenshot taken from Coursera 3:00

Now, let's talk about how we implement our K nearest neighbor search in practice. And the setup is exactly like it was for one nearest neighbor, but instead of outputting just the most similar house, we're gonna output a set of similar houses, specifically k similar houses.

Screenshot taken from Coursera 3:30

For completeness I've included what the k nearest neighbor algorithm looks like where the one key insight here is to maintain a sorted queue of the k nearest neighbors as you're going through the different steps of your search. So let's just walk through this algorithm here.

  • What we're gonna do is we're gonna initialize what I'm calling distance to k nearest neighbors, which is a list of distances, the current distances to the k nearest neighbors I have at the current iteration. And so to start with, we can just look at the first k houses in our data set and initialize those as our k nearest neighbors. But we're gonna sort these distances for these k nearest neighbors for these first k houses. However these distances sort, we're gonna likewise sort the first k houses in our data set to be our current estimate of our nearest neighbors.
  • Then for now, we're gonna just cycle through houses k + 1 to capital N. What we're gonna do is we're gonna compute the distance of this new house that we haven't seen before to our query house. And if that distance is less than the distance to my kth nearest neighbor, the house that's furthest away, then what I'm going to do is I'm going to go through my sorted list. Because remember just because this new house is closer then the kth nearest neighbor, that doesn't mean that this new house becomes my kth nearest neighbor because it could be closer than my second nearest neighbor or fourth.
  • So what I need to do is I need to go through and find which index this house should go into. Meaning there's some point at which all houses nearest neighbors one, two, j minus one are actually closer than this house that I'm looking at now. But this house is closer than all the other nearest neighbors in my set. And what I'm gonna do, now that we have this index j, we're gonna go and try and insert this house that we found that's one of our current estimates of our nearest neighbor into the queue. And what we're gonna do is we're simply knock off the kth nearest neighbor, which we know was our worst estimate of our nearest neighbor so far. We're gonna go and insert this house into this q. So this step is just shifting the indices for our list of houses, just knocking off that kth last house. And then this step is doing exactly the same thing, but for the distances associated with each of these nearest neighbors.
  • So finally, here, this is where we're inserting the new nearest neighbor. Where, in that jth slot, we're putting the distance to this newly found house. And in the jth slot of our list of houses, we're putting the house itself. So once we cycle through all our observations in our data set, what we're gonna end up with is a set of the k most similar houses to the query house. And like we described before, when we wanna go and predict the value associated with our query house, we're simply gonna average over the values associated with these k nearest neighbors.

Screenshot taken from Coursera 4:00

3-2) k-Nearest neighbors in practice

Now let's look at k-Nearest neighbors in practice.

We're gonna look at exactly the same data set we looked at for one neighbors where we show that really noisy fit. And what we see is that things look a lot better here. So, in this yellow box What we're showing are all the nearest neighbors for a specific target point x zero. So, there's a red line going up from our target, our query point, to a yellow box and this yellow box has all the nearest neighbor observations highlighted as red circles instead of grey circles for this one query point. And if we think about averaging the values of all these points, that results in the value of the green line at this target point x0. And we can repeat this for every value of our input space, and that's what's gonna give us this green curve.

And so what we see here is that this fit looks much more reasonable than that really noisy one nearest neighbor fit we showed before. But, one thing that I do want to point out is that we get these boundary effects and the same is true if we have limited data in any region of the input space but in particular at the boundary the reason that we get these constant fits is the fact that our nearest neighbors are exactly the same set of points for all these different input points. Because if I'm all the way over at the boundary, all my nearest neighbors are the k points to either the right or left of me depending which boundary I'm at. And then if I shift over one point I still have the same set of nearest neighbors obviously except for the one point that is the query point but aside from that its basically the same set of values that you're using at each one of these points along with the boundary but overall we see that we've been able to cope with some of the noise that we had in the one nearest neighbor situation a lot better than we did before.

Screenshot taken from Coursera 1:00

But beyond the boundary issues, there's another fairly important issue with the K nearest neighbors fit, which is the fact that you get discontinuity. So, if you look closely at this green line, it's these, a bunch of jumps between values. And the reason you get those jumps is the fact that as you shift from one input to the next input, a nearest neighbor is either completely in or out of the window. So, there's this effect where all of a sudden a nearest neighbor changes, and then you're gonna get a jump in the predicted value.

Screenshot taken from Coursera 2:00

And so the overall effect on predictive accuracy might actually not be that significant. But there's some reasons we don't like fits with these types of discontinuities. First, visually maybe it's not very appealing. But let's think in terms of our housing application, where what this means is that if we go from a house, for example, 2640 square feet to a house of 2641 square feet. To you, that probably wouldn't make much of a difference in assessing the value but if you have a discontinuity between these two points what it means is there's a jump in the predicted value.

So, I take my house as sum predicted value I just add one square feet and predicted value would have perhaps significant increase or decrease. And so that is not very attractive in the applications like housing. And more generally we just don't ten do believe these types of fits.

Screenshot taken from Coursera 3:00

3-3) Weighted k-nearest neighbors

So this leads us to consider something called Weighted K-nearest Neighbors. And what weighted k-nearest neighbors does, is it down weights the neighbors that are further from the specific query point, or target point. And the effect that has is as we're shifting from target point to target point, when a neighbor jumps in or out of our set of nearest neighbors, the effect of that isn't as significant because when I'm computing what my prediction is at a given value. I've already down weighted that neighbor so when it drops out, it was not that signficant in forming my predicted value to start with.

So, in particular, when we talk about our predicted value, we're gonna weight each one of our nearest neighbor observations by sum amount that I'll call C subscript Q, for looking at query point Q, and then I also have a nearest neighbor j, indexing which nearest neighbor that weight is associated with.

Screenshot taken from Coursera 0:30

The question is how do we wanna define these weights?

Well, if the distance between my query point and its nearest neighbor is really large and we want this constant that we're multiplying by to be very small because we wanna down weight that observation. But in contrast, if the distance is very small, meaning this house is very similar to my house, I want that constant to be larger.

So one simple choice is simply to take our weights to be equal to 1 over the distance between these points. $$C_{q NN_j} = \frac{1}{distance(x_j,x_q)}$$ So this will have the effect that when the distance is larger the weight is smaller. When the distance is smaller the weight is larger.

Screenshot taken from Coursera 1:00

More generally we can think of defining these weights using something that's called a kernel where here in this slide we're gonna focus on the simplest case of what are called isotropic kernels. And these kernels are just the function of the distance between any point and the target point, the query $x_q$.

And in this figure we're showing a couple examples of commonly used kernels. What we see is this kernel is defining how the weights are gonna decay, if at all, as a function of the distance between a given point and our query point.

So as an example here, one commonly used kernel is something called The Gaussian kernel. And what we see is that this kernel takes the distance between $x_i$ and $x_q$, looks at the square of that distance. And then divides that by a parameter of lambda and that lambda is going to define how quickly the function decays and importantly, this whole thing that I described is exponentiated so that all these weights are positive.

  • But I wanna emphasize that this parameter lambda that all of these kernels have, that's why I've written $Kernal_{\lambda}$ , (kernel sub lambda), define how quickly we're gonna have the weights decay as a function of this distance.
  • And another thing that's worth noting is the fact that this Gaussian kernel that I've mentioned here Has the weights never go exactly to zero, they just become very small as the distance increases, but for the other kernels I've shown on this plot here. There is once you get beyond this minus lambda to lambda region, the weights go exactly to zero.

Screenshot taken from Coursera 3:00

So what we just described was assuming that we were just looking at a single input like square feet. And thus our distance we just our standard Euclidean distance of looking at the absolute value between the two inputs. But more generally you can use the same kernels and higher dimensional spaces, where, again you just compute the distance, however you've defined it in your multi varied space, and plug that in to the kernel to define your weight.

Screenshot taken from Coursera 4:00

4) Kernel regression

4-1) From weighted k-NN to kernel regression

So, these ideas lead directly to a related, but slightly a different approach called kernel regression.

Let's start by just recalling what we did for weighted k-NN. And what we did was we took our set of k-NN, and we applied some weight to each nearest neighbor based on how close that neighbor was to our query point.

Screenshot taken from Coursera 0:10

Well for kernel regression, instead of just weighting some set of k-NN, we're gonna apply weights to every observation in our training set. So in particular, our predicted value is gonna sum over every observation, and we're gonna have a weight $C_{qi}$, (C sub qi) on each one of these data points. And this weight is gonna be defined according to the kernels that we talked about for weighted k-NN.

And in statistics, this is often called Nadarya-Watson kernel weighted averages.

Screenshot taken from Coursera 0:40

So let's see what effect this kernel regression has in practice. And now what we're showing is this yellow, curved region, represents some kernel that we're choosing. In this case, it's called the Epanechnikov kernel, and what we see are some set of red, highlighted observations for our given target point, in this case called (x, 0).

And I want to emphasize that we didn't set a fixed number of observations to highlight as red like we did in our k-NN. Here what we did is we chose a kernel with a given, what's called bandwidth, that's the lambda parameter we discussed earlier, and that defines a region in which our observations are gonna be included in our waiting. Because, in this case, when we're looking at this Epanechnikov kernel, this kernel has bounded support. And that's in contrast to, for example, the Gaussian kernel. And what that means is that there's some region in which points will have this weighting, these decaying sets of weights. And then outside this region the observations are completely discarded from our fit at this one point.

So, what we do is at every target point, (x, 0), in our input space, we weight our observations by this kernel. So we compute this weighted average, and we say that is our predicted value. We do that at each point in our input space to carve out this green line as our predicted fit.

So the result of this kernel regression isn't very different from than what the fit would look like from weighted k-NN. It's just in this case instead of specifying k, we're specifying a region based on this kernel bandwidth for weighting the observations to form this fit. But what we see is, in this case for, for this kernel regression fit, which like I've alluded to, should look fairly similar to our weighted k-NN. Things look a lot smoother than our standard k-NN's fit.

Screenshot taken from Coursera 2:00

So there are two important questions when doing kernal regression. One is which kernel to use, and the other is, for a given kernel what bandwidth should I use? But typically the choice of the bandwidth matters much more then the choice of kernel. So to motivate this, let's just look again at this Epanechnikov kernel with a couple different choices of bandwidths.

And what we see is that the fit dramatically changes. Here for example, we get a much wilder overfitting. Here for this lambda value, things look pretty reasonable. But when we choose a bandwidth that's too large, we start to get over smoothing. So this is an oversmooth fit to the data. I'm not making very good predictions. So we can think of this in terms of this bias variance trade-off. This lambda parameter is controlling how much we're fitting our observations, which is having low bias but high variance. Very sensitive to the observations. If I change the observations, I get a dramatically different fit for kernel regression, versus over here for a very large bandwidth, it's the opposite. I have very low variance as I'm changing my data. I'm not gonna change the fit very much, but high bias. Significant differences relative to the true function, which is shown in blue.

But just to show how we're fairly insensitive to the choice of kernel, here in the middle plot I'm just gonna change the kernel from this Epanechnikov to our boxcar kernel. And where the boxcar kernel, instead of having decaying weights with distance, it's gonna have a fixed set of weights. But we're only gonna include observations within some fixed window, just like the Epanechnikov kernel. So this boxcar kernel starts to look very, very similar to our standard k-NN. But again, instead of fixing what k is, you're fixing a neighborhood about your observation. But what you see is that although this boxcar window has these types of discontinuities we saw with k-NN, because observations are either in or they're out. The fit looks fairly similar to our Epanechnikov kernel. So, this is why we're saying that the choice of bandwidth has a much larger impact than the choice of kernel.

Screenshot taken from Coursera 5:00

So this leads to the next important question of, how are we gonna choose our bandwidth parameter lambda that we said matters so much? Or when we're talking about k-NN, the equivalent parameter we have to choose is k, the number of nearest neighbors we're gonna look at. And again, in the k-NN, I just wanna mention this now, we saw a similar kind of bias variance trade-off, where for one nearest neighbors we saw these crazy, wildly, overfit functions. But once we got to k, for some larger k value, we had much more reasonable and well behaved fits.

So again, we have the same type of bias variance trade-off for that parameter as well. And so the question is, how are we choosing these tuning parameters for these methods that we're looking at? Well, its the same story as before, so we don't have to go into the lengthy conversations that we've had in past modules, and hopefully you know the answer is cross validation. Or, using some validation set, assuming you have enough data to do that.

Screenshot taken from Coursera 6:00

4-2) Global fits of parametric models vs. local fits of kernel regression

At the beginning of this module, we talked about this idea of fitting globally versus fitting locally. Now that we've seen k nearest neighbors and kernel regression, I wanna formalize this idea.

So in particular, let's look at what happens when we just fit a constant function to our data. So in that case that's just computing what's called a global average where we take all of our observations, add them together and take the average or just divide by that total number of observations. So that's exactly equivalent to summing over a weighted set of our observations, where the weights are exactly the same on each of our data points, and then dividing by the total sum of these weights.

So now that we've put our global average in this form, things start to look very similar to the kernel regression ideas that we've looked at. Where here it's almost like kernel regression, but we're including every observation in our fit, and we're having exactly the same weights on every observation. So that's like using this box car kernel that puts the same weights on all observations, and just having a really really massively large bandwidth parameters such that for every point in our input space all the other observations are gonna be included in the fit.

Screenshot taken from Coursera 1:00

But now let's contrast that with a more standard version of kernel regression, which leads to what we're gonna think of as locally constant fits. Because if we look at the kernel regression equation, what we see is that, it's exactly what we had for our global average, but now it's gonna be weighted by this kernel. Where in a lot of cases, what that kernel is doing, is it's putting a hard limit that some observations outside of our window of around whatever target point what we're looking at, are out of our calculation.

So the simplest case we can talk about is this box car kernel, that's gonna put equal weights over all observations, but just local to our target point $x_0$. And so, we're gonna get a constant fit but, just at that one target point, and then we're going to get a different constant fit at the next target point, and the next one, and the next one.

And, I want to be clear that the resulting output isn't a stair case kind of function. It's not a collection of these constant fits. It is a collection of the constant fits, but just at a single point. So we're taking a single point, doing another constant fit, taking the single point, which is at that target, and as we're doing this over all our different inputs that's what's defining this green curve.

Okay, but if we look at another kernel, like our Epanechnikov kernel that has the weights decaying over this fixed region. Well, it is still doing a constant fit, but how is it figuring out what the level of that line should be at our target point? Well, what it's doing is, it's just down weighting observations that are further from our target point and emphasizing more heavily the observations more close to our target point. So this is just a weighted global average but its no longer global it's local because we're only looking at observations within this defined window. So we're doing this weighted average locally at each one of our input points and tracing out this green curve.

So, this hopefully makes very clear how before in the types of linear regression models we were talking about, we were doing these global fits which in the simplest case, was just a constant model. That was our most basic model we could consider having just the constant feature and now what we're talking about is doing exactly the same thing but locally and so locally that it's at every single point at our input space.

Screenshot taken from Coursera 3:00

So this kernel regression method that we've described so far, we've now motivated as fitting a constant function locally at each observation, well more than each observation, each point in our input space. And this is referred to as locally weighted averages but instead of fitting a constant at each point in our input space we could have likewise fit a line or polynomial. And so what this leads to is something that's called locally weighted linear regression.

Screenshot taken from Coursera 4:00

We are not going to go through the details of of locally weighted linear regression in this module. It's fairly straightforward. It's a similar idea to these local constant fits, but now plugging in a line or polynomial. But I wanted to leave you with a couple rules of thumb for which fit you might choose between a different set of polynomials that you have options over.

  • And one thing that fitting a local line instead of a local constant helps you with are those boundary effects that we talked about before. The fact that you get these large biases at the boundary. So you can show very formally that these local linear fits help with that bias.
  • if we talk about local quadratic fits, that helps with bias that you get at points of curvature in the interior view of space. So, for example, we see that blue curve we've been trying to fit, and if we go back, maybe it's worth quickly jumping back to what our fit looks like we see that, towards the boundary we get large biases, and right at the point of curvature, we also have a bias where we're under fitting the true curvature of that blue function. And so the local quadratic fit helps with fitting that curvature. But what it does is it actually leads to a larger variance so that can be unattractive.

So in general just a basic recommendation is to use just a standard local linear regression, fitting lines at every point in the input space.

Screenshot taken from Coursera 5:00

5) k-NN and kernel regression wrapup

5-1) Performance of NN as amount of data grows

So now let's step back and discuss some important theoretical and practical aspects of K-nearest neighbors and kernel regression.

If you remember the title of this module it was Going Nonparametric, and we've yet to mention what that means. What is a nonparametric approach? Well, K-nearest neighbors and kernel regression are examples of nonparametric approaches. And the general goal of a non-parametric approach is to be really flexible in how you're defining f of x and in general you want to make as few assumptions as possible. And the really key that defines a non-parametric method Is that the complexity of the fit can grow as you get more data points. We've definitely seen that with K-nearest neighbors and kernel regression, in particular the fit is a function of how many observations you have. But these are just two examples of nonparametric methods you might use for regression. There are lots of other choices. Things like splines, and trees which we'll talk about in the classification course, and locally weighted structured versions of the types of regression models we've talked about.

Discussion

Screenshot taken from Coursera 1:00

So nonparametrics is all about this idea of having the complexity grow with the number of observations. So now let's talk about what's the limiting behavior of nearest neighbor regression as you get more and more data. And to start with, let's just assume that we get completely noiseless data. So every observation we get lies exactly on the true function. Well in this case, the mean squared error of one nearest neighbor regression goes to zero, as you get more and more data.

But let's just remember what mean squared error is and if you remember from a couple modules ago, we talked about this bias-variance trade-off and that mean squared error is the sum of bias squared plus variance. So having mean squared error go to zero means that both bias and variance are going to zero.

Screenshot taken from Coursera 1:30

So to motivate this visually, let's just look at a couple of movies. Here, in this movie, I'm showing what the one nearest neighbor fit looks like as we're getting more and more data. So, remember the blue line is our true curve. The green line is our current nearest neighbor fit based on some set of observations that are gonna lie exactly on the true function at that blue curve. Okay, so here's our fit changing as we get more and more data and what you see is that it's getting closer, and closer, and closer, and closer, and closer to the true function. And hopefully you can believe that in limit of getting an infinite number of observations spread over our input space this nearest neighbor fit is gonna lie exactly on top of the true function. And that's true of all possible data sets of infinite number of observations that we would get.

In contrast, if we look what just doing a standard quadratic fit, just our standard least squares fit we talked about before. No matter how much data we get, there's always gonna be some bias. So we can see this here, where especially at this point of curvature we see that this green fit, even as we get lots and lots of observations, is never matching up to that true blue curve. And that's because that true blue curve, that's actually a part of a sinusoid. We've just zoomed in on a certain region of that sinusoid. And so this quadratic fit is never exactly gonna describe what a sinusoid is describing.

Screenshot taken from Coursera 3:00

So this is what we talked about before, about the bias that's inherent in our model, even if we have no noise. So even if we eliminate this noise, we still have the fact that our true error, as we're getting more and more and more data, is never gonna to go exactly to zero. Unless of course the data were generated from exactly the model that we're using to fit the data. But in most cases, for example, maybe you have data that looks like the following.

So this is our noiseless data, and if we can strain our this was all for fixed model complexity remember, if we can strain our model to say be a quadratic, then maybe this will be our best quadratic fit. And no matter how many observations I give you from this more complicated function, this quadratic is never gonna have zero bias. In contrast, let's switch colors here, so that we can draw our one nearest neighbor fit. Our one nearest neighbor, as we get more and more data, it's fitting these constants locally to each observation. And as we get more and more and more data, the fit is gonna look exactly like the true curve.

And so when we talk about our true error with increasing number of observations. Our true error for, this is a plot of true error for one nearest neighbor, is going to go to zero for noiseless data.

Screenshot taken from Coursera 5:00

But now let's talk about the noisy case. This is the case that we're typically faced with in most applications. And in this case what you can say is that the mean squared error of nearest neighbor regression goes to zero. If you allow the number of neighbors or the k in our nearest neighbor regression, to increase with the number of observations as well. Because if you think about getting tons and tons and tons of observations. If you keep k fixed, you're just gonna be looking at a very, very, very local region of your input space. And you're gonna have A lot of noise introduced from that, but if you'll allow k to grow, it's gonna smooth over the noise that's being introduced.

So let's look at a visualization of this.

  • So here what we're showing is the same true function we've shown throughout this module, but we're showing tons of observations, all these great dots. But they're noisy observations, they're no longer lying exactly on that blue curve. That's why they're this cloud of blue points. And we see that our one nearest neighbor fit is very, very noisy. Okay, it has this wild behavior, because like we discussed before, one nearest neighbor is very sensitive to noise in the data.
  • But in contrast, if we look at a large k, so here we're looking at k equals 200. So our 200 nearest neighbors fit, it looks much, much better. So you can imagine that as you get more and more observations, if you're allowing k to grow, you can smooth over the noise being introduced by each one of these observations, and have the mean squared error going to zero.
  • But in contrast, again, if we look at just our standard least squares regression here in this case of a quadratic fit, we're always gonna have bias. So nothing's different by having introduced noise. It, if anything, will just make things worse.

Screenshot taken from Coursera 6:00

5-2) Issues with high-dimensions, data scarcity, and computational complexity

So, we've talked about the performance of nearest neighbor regression, as you get tons and tons of observations. And things look pretty good in that case. Actually, amazingly good, but the picture isn't quite as nice when we start increasing the dimension of our input space or if we just have a limited set of observations. Because the performance of nearest neighbors regression, as we talked about before, is very sensitive to how well the data covers the input space. If you remember we had this plot early on where we just removed observations from some point of our input space and we got a pretty bad 1-Nearest neighbor fit.

Well, now imagine you have a really a huge, really, really high dimensional space that you are looking at. And, you need your observations to cover all points within this space. Well, you're gonna need an exponentially large number of observations in the dimension of the space you're looking at in order to cover this space and have good nearest neighbor performance. And that's typically really hard to do, to have as many observations as you would need in these high-dimensional spaces.

So, the cases where you have limited data relative to the dimension of the input space you're looking at is where these parametric models that we've talked about throughout the rest of this course become so important.

Screenshot taken from Coursera 1:00

And another thing that's important to talk about is the complexity of our nearest neighbor search and the naive type of algorithm that we've talked about so far which is just our search through our entire data set can be quite computationally intensive.

So, for example, if you have just one query point, xq, and you're just doing one nearest neighbor search, you have to scan through every single observation in your data set and compute the distance to each one of those observations. So, going from our query house to each one of the different houses in our data set computing this similarity between these houses in order to find the nearest neighbor. So, that has complexity, that's linear in the number of observations that we have.

Or if we wanna do our k nearest neighbor as approach, we have to maintain a list of the k nearest neighbors and if you do this in this sorted queue that we've talked about. Then, you can do this search in O(Nlogk). But both of these, either for our 1-NN or our k-NN, have complexity linear and the number of observations we have.

But what if N is huge? Because this is the situation where we've said these k-NN approaches work so well. Well, that can be really, really intensive to do, and especially if you have to do many queries. So, if you wanna predict the values at many points in your input space, this can be a really, really intensive procedure. So, instead, we're gonna talk about more efficient methods for doing this type of nearest neighbor search in our Clustering and Retrieval course later in this specialization.

Screenshot taken from Coursera 2:00

5-3) k-NN for classification

So we've talked about using k-NN for regression, but these methods can also be very, very straightforwardly used for classification. So this is a little warm-up for the next course in this specialization, which is our classification course.

And let's start out by just recalling our classification task, where we're gonna do this in the context of spam filtering. Where we have some email as our input, and the output is gonna be whether the email is spam or not spam. And we're gonna make this decision based on the text of the email. Maybe information about the sender, IP and things like this.

Screenshot taken from Coursera 0:30

Well, what we can do is use k-NN for classification. Visually we can think about just taking all of the emails that we have labeled as spam or not spam and throwing them down in some space. Where the distance between emails in this space represents how similar the text or the sender IP information is. All the inputs or the features we're using to represent these emails. And then what we can do, is we get some query email that comes in. So, that's this little gray email here. And we're gonna say, is it spam or not spam? There's a very intuitive way to do this, which is just search for the nearest neighbors of this email. The emails most similar to this email. And then we're just gonna do a majority vote on the nearest neighbors to decide whether this email is spam or not spam.

And what we see in this case, is that four of the neighbors are spam, and only one neighbor is not spam, so we're gonna label this email as spam. And so this is the really, really straightforward approach of using k-NN for classification.

Screenshot taken from Coursera 1:00

5-4) A brief recap

Screenshot taken from Coursera 1:00